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Abstract 

Log-density gradient estimation is a fundamental statistical problem and possesses 
various practical applications such as clustering and measuring non-Gaussianity. 

A naive two-step approach of first estimating the density and then taking its log- 
gradient is unreliable because an accurate density estimate does not necessarily 
lead to an accurate log-density gradient estimate. To cope with this problem, 
a method to directly estimate the log-density gradient without density estima¬ 
tion has been explored, and demonstrated to work much better than the two-step 
method. The objective of this paper is to further improve the performance of this 



direct method in multi-dimensional cases. Our idea is to regard the problem of 
log-density gradient estimation in each dimension as a task, and apply regularized 
multi-task learning to the direct log-density gradient estimator. We experimen¬ 
tally demonstrate the usefulness of the proposed multi-task method in log-density 
gradient estimation and mode-seeking clustering. 


1 Introduction 


Multi-task learning is a paradigm of machine learning for solving multiple related 
learning tasks simultaneously with the expectation that information brought by 


other related tasks can be mutually exploited to improve the accuracy Caruana 


1997 . Multi-task learning is particularly useful when one has many related learn¬ 


ing tasks to solve but only few training samples are available for each task, which 


is often the case in many real-world problems such as therapy screening Bickel 


et ah, 2008 and face verification Wang et al. 2009 . 


Multi-task learning has been gathering a great deal of attention, and extensive 


studies have been conducted both theoretically and experimentally Thrun 1996 


Evgeniou and Pontil 2004 Ando and Zhang 2005 Zhang, 2013 Baxter 2000 


Thrun 1996 proposed the lifelong learning framework, which transfers the knowl¬ 


edge obtained from the tasks experienced in the past to a newly given task, and it 


was demonstrated to improve the performance of image recognition. Baxter Bax¬ 


ter 


2000 defined a multi-task learning framework called inductive bias learning, 


and derived a generalization error bound. The semi-supervised multi-task learning 


method proposed by Ando and Zhang 2005 generates many auxiliary learning 
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tasks from unlabeled data and seeks a good feature mapping for the target learn¬ 
ing task. Among various methods of multi-task learning, one of the simplest and 
most practical approaches would be regularized multi-task learning |Evgeniou and 


Pontil, 2004, Evgeniou et al. 2005 , which uses a regularizer that imposes the so¬ 


lutions of related tasks to be close to each other. Thanks to its generic and simple 
formulation, regularized multi-task learning has been applied to various types of 
learning problems such as regression and classification |Evgeniou and Pontil, 2004, 


Evgeniou et al. 2005 . In this paper, we explore a novel application of regularized 


multi-task learning to the problem of log-density gradient estimation Beran, 1976 


Cox, 1985, Sasaki et al. 2014 . 


The goal of log-density gradient estimation is to estimate the gradient of the 
logarithm of an unknown probability density function using samples following it. 


Log-density gradient estimation has various applications such as clustering Fuku- 


naga and Hostetler 1975, Cheng[ 1995 Comaniciu and Meer, 2002 Sasaki et al. 


2014 , measuring non-Gaussianity Huber, 1985 and other fundamental statistical 


topics Singh, 1977 


Beran [19761 proposed a method for directly estimating gradients without going 


through density estimation, to which we refer as least-squares log-density gradients 
(LSLDG). This direct method was experimentally shown to outperform the naive 
one consisting of density estimation followed by log-gradient computation, and 


was demonstrated to be useful in clustering Sasaki et al., 2014 


The objective of this paper is to estimate log-density gradients further ac¬ 


curately in multi-dimensional cases, which is still a challenging topic even us- 


3 
























































ing LSLDG. It is important to note that since the output dimensionality of the 
log-density gradient Vlogp(:r) is the same as its input dimensionality d, multi¬ 
dimensional log-density gradient estimation can be regarded as having multiple 
learning tasks if we regard estimation of each output dimension as a task. Based 
on this view, in this paper, we propose to apply regularized multi-task learning 
to LSLDG. We also provide a practically useful design of parametric models for 
successfully applying regularized multi-task learning to log-density gradient es¬ 
timation. We experimentally demonstrate that the accuracy of LSLDG can be 
significantly improved by the proposed multi-task method in multi-dimensional 
log-density estimation problems and that a mode-seeking clustering method based 
on the proposed method outperforms other methods. 

The organization of this paper is as follows: In Section [2j we formulate the 
problem of log-density gradient estimation and review LSLDG. Section [3] reviews 
the core idea of regularized multi-task learning. Section [4] presents our proposed 
log-density gradient estimator and algorithms for computing the solution. In Sec¬ 
tion [5j we experimentally demonstrate that the proposed method performs well 
on both artificial and benchmark data. Application to mode-seeking clustering is 
given in Section [ 6 ] Section [ 7 ] concludes this paper with potential extensions of this 
work. 


2 Log-density gradient estimation 

In this section, we formulate the problem of log-density gradient estimation , and 
then review LSLDG. 
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(a) Density estimation (b) Log-density gradient estimation 

Figure 1: A comparison of two log-density gradient estimates based on density 
estimation. In (a), is a better estimate to the true density p than p \, while 
in (b), V log pi is a better estimate to the true log-density gradient Vlogp than 

Vlogp 2 - 

2.1 Problem formulation and a naive method 

Suppose that we are given a set of samples, {a?j}” =1 , which are independent and 
identically distributed from a probability distribution with unknown density p(x) 
on The problem is to estimate the gradient of the logarithm of the density 
p(x) from {ccj}f =1 : 

Vlogp(^) = (<9i logp(a;),..., 9 d logp(a;)) T = , • • •» 

where dj denotes the partial derivative operator d/dx^ for x = ... ,x^' ) ) T . 

A naive method for estimating the log-density gradient is to first estimate the 
probability density, which is performed by, e.g., kernel density estimation (KDE) 
as 
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where a > 0 denotes the Gaussian bandwidth, then to take the gradient of the 
logarithm of p{x) as 


dj logp(a;) = 


djP(x) 

p(x) 


However, this two-step method does not work well because an accurate density 
estimate does not necessarily provide an accurate log-density gradient estimate. 
For example, Figure [l] illustrates that a worse (or better) density estimate can 
produce a better (or worse) gradient estimate. 

To overcome this problem, LSLDG, a single-step method which directly esti¬ 


mates the gradient without going through density estimation, was proposed Be 


ran 


1976, Cox, 1985, Sasaki et al. 2014 , and has been demonstrated to experi¬ 


mentally work well. Next, we review LSLDG. 


2.2 Direct estimation of log-density gradients 


The basic idea of LSLDG is to directly fit a model g 3 {x) to the true log-density 
gradient djlogp(x ) under the squared loss: 


RjiSj) ■= / (9j(x) -djlogp(x)) 2 p(x)dx 


9j(x) 


djP(x) 

P(x) 


p(x)dx 


= I gj(x) 2 p(x)dx — 2 J gj(x)djp(x)dx + Cj 

= [ 9j(x) 2 p(x)dx -2 / [gj(x)p{x)] x ^ ) Z°^ 00 dx^ 3) + 2 [ d j g j (x)p(x)dx + Cj 


= / gj(x) 2 p(x)dx + 2 / djgj(x)p(x)dx + Cj, 
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where C 3 := f (djp(x)) 2 p(x)dx is a constant that does not depend on g 3 , 
f (-)dx^^ denotes integration except for x^\ and the last deformation comes from 
integration by parts under the mild condition that gj(x)p(x) —* 0 as \x^\ —> oo. 

Then, the LSLDG score Jj(g 3 ) is given as an empirical approximation to the 
risk Rj(gj) subtracted by Cf 

1 n g n 

jj(dj) ;= — 9j( x i ) + — yi dj9j( x i)- (i) 

i=i i=i 

As gj(x), a linear-in-parameter model is used: 

b 

9j (*) = 0]il>j (*) = 6 f'^ (*) > ( 2 ) 

fe=i 

where 6^ is a parameter, ipj k \x) is a differentiable basis function, and b is the 
number of the basis functions. By substituting Q into (JTj) and adding an i 2 - 
regularizer, we can analytically obtain the optimal solution 0 3 as 

0 3 = argmin 0 ; G ,()j + 2hj 0, + Aj||0j|| 2 ] 

— ~(Gj + l hj-> 


where Xj > 0 is the regularization parameter, is the b x b identity matrix, and 

^ n i n 

G J '■ h 3 '■= 


2=1 


n 


2=1 


Finally, an estimator of the log-density gradient is obtained by 

&(*) : = 

It was experimentally shown that LSLDG produces much more accurate esti¬ 
mates of log-density gradients than the KDE-based gradient estimator and that 


the clustering method based on LSLDG performs well Sasaki et al. 2014 
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3 Regularized multi-task learning 


In this section, we review a multi-task learning framework called regularized multi¬ 
task learning |Evgeniou and Pontil 2004, Evgeniou et al., 2005 , which is powerful 
and widely applicable to many machine learning methods. 

Consider that we have T tasks of supervised learning as follows. The task t 
is to learn an unknown function f*(x) from samples of input-output pairs 


{(xf\ where yp is the output f*(x) with noise at the input 


x = x „■ 


(*) 


When ft(x) is modeled by a parameterized function f t (x]G t ), learning is per¬ 
formed by finding the parameter 0 t which minimizes the empirical risk associated 
with some loss function l(y,y'): 


nt 


0 t = arg min — S' e t)) = argmin J t (0 t ), 

o t n f o t 


i=i 


where J t {6 t ) = E"=i Kv? , 0 t )). 

In regularized multi-task learning, the objective function has regularization 
terms which impose every pair of parameters to be close to each other while Jt(Ot) 
are jointly minimized: 

t=1 t=l,t'=l 

where 7 > 0 is the regularization parameter and 7 ttt > > 0 are the similarity pa¬ 
rameters between the tasks t and t'. 

It was experimentally demonstrated that the multi-task support vector regres¬ 


sion 


Evgeniou and Pontil, 2004, Evgeniou et al. 2005 , performs better than the 


single-task counterpart Vapnik et ah, 1997 especially when the tasks are highly 


related each other. 



























4 Proposed method 


In this section, we present our proposed method and algorithms. 

4.1 Basic idea 

Our goal in this paper is to improve the performance of LSLDG in multi¬ 
dimensional cases. For multi-dimensional input x, the log-density gradient 
Vlogp(a3) has multiple output dimensions, meaning that its estimation actu¬ 
ally consists of multiple learning tasks. Our basic idea is to apply regularized 
multi-task learning to solve these tasks simultaneously instead of learning them 
independently. 

This idea is supported by the fact that the target functions of these tasks, 
d\ logp(a;),..., dd logp(cc), are all derived from the same log-density logp(cc), and 
thus they must be strongly related to each other. Under such strong relatedness, 
jointly learning them with sharing information with each other would improve 
estimation accuracy as has been observed in other existing multi-task learning 
work. 

4.2 Regularized multi-task learning for least-squares log- 
density gradients (MT-LSLDG) 

Here, we propose a method called regularized multi-task learning for least-squares 
log-density gradients (MT-LSLDG). 

Our method MT-LSLDG is given by applying regularized multi-task learning 
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to LSLDG. Specifically, we consider the problem of minimizing the following 
objective function: 

d d i d 

J(0i ,..., Of) = Jj(Sj (S @j)) + A? Ill'll + 2 ^ Tw'll^j — fy'll 

j=i j=i j,j'=i 

d 1 d 

= ^ (0jG,-0,- + 20jhj + A i ||0 i || 2 ) + -7 ^2 ~ 0 i'll 2 ; ( 3 ) 

j=i 

where the last term is the multi-task regularizer which imposes the parameters 
close to each other. 

Denoting the nrinimizers of 0 by 0 u ...,e d , the estimator g(x) = 
(?i(*), • • • , 9 d{x)) T is given by 

9j( x ) = 9 j(x; Of) = Ojil>j(x). (4) 

We call this method regularized multi-task learning for least-squares log-density 
gradients (MT-LSLDG). 

4.3 The design of the basis functions 

The design of the basis functions ifj(x) in MT-LSLDG is crucial to enjoy the 
advantage of regularized multi-task learning. A simple design would be to use 
a common function <f>(x) = (<//b(cc),..., ^ b \x)) to all x), that is, ifi(x) = 

• • • = ifd{x) = <fi(x). From 0 and Q, in this design, the multi-task regularizer 
promotes 'g J (x: Of) to be more close to each other so that 

?i(*;0i) ~ ~ 9d{x\d d ). 

However, it is inappropriate that all 'gfx'. Of are similar because the different 
true partial derivatives, say djlogp(x) and dp logp(x) for j f j', show different 
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profiles in general. 

To avoid this problem, we propose to use the partial derivatives of <f>(x) as 
basis functions: 


^f\x) = dj<t>(x) 


For this basis function design, it holds that 

9j(x) = Oj il>j(x) = 0 ] dj4>{x ) = djdJtf>(x) « dj log p(x). 

Thus, 6 • (pix) are approximations of the true log-density logp(cc). Since the multi¬ 
task regularizer encourages the log-density estimates 0j<p(x) to be more similar, 
this basis design would be reasonable. 

As a specific choice of ( fi( k \x ), we use a Gaussian kernel: 


ipj k \x) = dj exp 


x - c k 


2a] 


C U) _ x (j) 


exp 


\x - c k | 

2 <7 | 


where c k are the centers of the kernels, and a. > 0 are the Gaussian band width 


parameters. 


4.4 Hyper-parameter selection by cross-validation 

As in LSLDG, the hyper parameters, which are the G-regularization parame¬ 
ters A j, the Gaussian width aj, and the multi-task parameters 7,7j,y, can be 
cross-validated in MT-LSLDG. The procedure of the Jl-fold cross-validation is as 
follows: First, we randomly partition the set of training samples S tr into K folds 
F\ ,..., Fk- Next, for each k = 1,..., K, we estimate the log-density gradient 
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using the samples in S tr \ F k , which is denoted by g^\ and then calculate the 
LSLDG scores for the samples in F k as J^y: 


j(k) _ v- ? (, 

cv “ \F k \ ^ 9j 

1 1 xeF k 


1 


«^) 2 + - 2 


E 

x£F k 


dx^ 


We average these LSLDG scores to obtain the ib-fold cross-validated LSLDG score: 


K 


J cv-^^ J cv- 


k= 1 

Finally, we choose the hyper-parameters that minimize Jcv. Throughout this 
paper, we set K = 5. 


4.5 Optimization algorithms in MT-LSLDG 

Here, we develop two algorithms for minimizing ([3]). One algorithm is to directly 
evaluate the analytic solution and the other is an iterative method based on block 


coordinate descent Warga, 1963 


4.5.1 Analytic solution 

For simplicity, we assume the similarity parameters are symmetric: 7 jjt = 7 yj. 
Then, the objective function J(6 1 ,..., Q c i) can be expressed as a quadratic function 
in terms of 6 = (6j ,..., 0j) T as 

J{0) = 6 t {G + C <g> I b )9 + 2 G T h, 

where 


G = diag(G 1 ,...,G d ), h = (hJ,...,h T d ) T , 

( d d 

^2 • • • , ^2 Tdj ) -jr, 

i=i i=i 
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[T]jji = 7 jji, diag(-, ...,•) is a block-diagonal matrix whose diagonal blocks are 
its arguments, and <E> denotes the Kronecker product. The minimizer 6 of J(0) is 
analytically computed by 


6 = arg rrhn ,1(0) = — (G + C <S> h) l h. 


(5) 


4.5.2 Block coordinate descent (BCD) method 

Direct computation of the analytic solution (J5| involves inversion of a db x db 
matrix. This may be not only expensive in terms of computation time but also 
infeasible in terms of memory space when the dimensionality d is very large. 

Alternatively, we propose an algorithm based on block coordinate descent 
(BCD) |Warga, 19631. It is an iterative algorithm which only needs manipulation 
of a relatively small bxb matrix at each iteration. This alleviates the memory size 
requirement and hopefully reduces computation time if the number of iterations 
is not large. 

A pseudo code of the algorithm is shown in Algorithm [!} At each update ([ 6 ]) 
in the algorithm, only one vector 6j is optimized in a closed-form while fixing 
the other parameters (/ ^ j). The update ([ 6 ]) only requires computing the 
inverse of a bxb matrix, which seems to be computationally advantageous over 
evaluating the analytic solution in terms of the computation cost and memory size 
requirement. 

Another important technique to reduce the overall computation time is to use 
warm start initialization: when the optimal value of 7 is searched for by cross- 
validation, we may use the solutions 0i,...,Od obtained with 7 as initial values 
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Algorithm 1 Block coordinate descent (BCD) algorithm. 
Initialize G ±,..., G d . 

repeat 

for j — 1 ,. . > , d do 


Oj e- arg min J(G 1 ,..., Gj- i, 6 »,. 0 i+ i, ...,G d ) 

0 i 


( Gj + AjJf, + 27 Ijj'Ib j | hj + 27 7 j,j'Qj 

\ rff J \ j'& 


end for 


until Gi,... ,G d converge. 


for another 7 . 


5 Experiments on log-density gradient estima¬ 
tion 

In this section, we illustrate the behavior of the proposed method and experimen¬ 
tally investigate its performance. 


5.1 Experimental setting 


In each experiment, training samples {ay}” =1 and test samples {cc(, are drawn 
independently from an unknown density p(x). We estimate Vlogp(a;) from the 
training samples, and then evaluate the estimation performance by the test score 


Jte{g) — ^ n , y 9 ]( x v) + n , y^ djgjjxy, 
7=1 . i'=\ i '=1 
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where g{x) = (gi(x ),... ,'g d (x)) T is an estimated log-density gradient. This score 
is an empirical approximation of the expected squared loss of g(x) over the test 


samples without the constant C 3 (see Section 2.2), and a smaller score means a 
better estimate. 

We compare the following three methods: 

• The multi-task LSLDG (MT-LSLDG): our method proposed in Section [lj 


The single-task LSLDG (S-LSLDG): the existing method Beran, 1976, Cox 


1985 reviewed in Section 2.2. This method agrees with MT-LSLDG at 


7 = 0 . 


• The common-parameter LSLDG (C-LSLDG): LSLDG with common param¬ 
eters O' — 0\ — ■ ■ • = 6d learned simultaneously. The solution is given as 

d d 


S' 


arg min 
e' 


0 ,T Y Gj6' + 2 Y h l 6 ' + A ll 6,/ f 

3 =1 3 =1 


( d \ ld 

3 =1 / 3 =1 

where A > 0 is the £ 2 - re g u larization parameter. This method agrees with 
MT-LSLDG at the limit 7 —>■ 00 . 


In all the methods, we set the number of basis functions as b = min{50, n}, 
and randomly choose the kernel centers Ci,..., Cj, from training samples {Xi}\ 1 =l . 
For hyper-parameters, we use the common ^-regularization parameter A and 
bandwidth parameter a among all the dimensions, Ai = • • • = A^ = A and 
ui = ■■■ = cr d = a. We also set all the similarity parameters as 7 ^ = 1, 
which assumes that all dimensions are equally related to each other. 
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5.2 Artificial data 


We conduct numerical experiments on artificial data to investigate the basic be¬ 
havior of MT-LSLDG. As data density p(x), we consider the following two cases: 

• Single Gaussian: The d -dimensional Gaussian density whose mean is 0 and 
whose covariance matrix is the diagonal matrix with the first half of the 
diagonal elements are 1 and the others are 5. 

• Double Gaussian: A mixture of two d-dimensional Gaussian densities with 
mean zero and (5, 0,, 0) T and identity covariance matrix. The mixing 
coefficients are 1/2. 

The dimensionality d and sample size n are specified later. 

First, we investigate whether MT-LSLDG improves the estimation accuracy 
of LSLDG at appropriate 7. We prepare datasets with different dimensionali¬ 
ties d = 2,10,20 and sample sizes n = 10,30,50. MT-LSLDG is applied to the 
datasets at each 7 e (0, 0.1, 0.25, 0.5,1, 2.5, 5,10, 00 }. The Gaussian bandwidth 
a and the ^-regularization parameter A are chosen by 5-fold cross-validation as 
described in Section [d] from the candidate lists {10” 1 ,10 -0 25 ,10 0 5 ,10 125 ,10 2 } and 
{10 —2 ,10 -125 ,10°- 5 ,10 025 ,10 1 }, respectively. The solution of MT-LSLDG is com¬ 
puted analytically as in (J5]) . 

The results are plotted in Figure [2} In the figure, the relative test score is 
defined as the test score from which the test score of S-LSLDG is subtracted, and 
thus negative relative scores indicate that MT-LSLDG improved the performance 
of S-LSLDG. When d = 2, MT-LSLDG does not improve the performance for 
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(a) Single Gaussian (n = 30) 


(b) Double Gaussian (n = 30) 



(c) Single Gaussian (d = 10) 


(d) Double Gaussian (d = 10) 


Figure 2 : Average (and standard errors) of relative test scores over 100 runs. The 
relative test scores refer to test scores from which the test score of S-LSLDG is 
subtracted. The black dotted lines indicate the relative score zero. 


any 7 values (Figure 2(a) and 2(b)). However, for higher-dimensional data, the 


performance is improved at appropriate 7 values (e.g., 7 = 0.5 for d = 20 in 


Figure 2(a) and 7 = 2.5 for d — 20 in Figure 2(b)). Similar improvement is 


observed also for smaller sample size (e.g., n — 10 and n = 30) in Figure 2(c) and 


Figure 2(d) 


These results confirm that MT-LSLDG improves the performance of S-LSLDG 


at an appropriate 7 value when data is relatively high-dimensional and the sample 
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size is small. Since such 7 is usually unknown in advance, we need to find a 
reasonable value in practice. 

Next, we investigate whether an appropriate 7 value can be chosen by cross¬ 


validation. In this experiment, the cross-validation method in Section 4 A is per¬ 
formed to choose 7 as well. The candidates of 7 is {0, 0.1, 0.25, 0.5,1, 2.5, 5,10, 00 }. 
The other experimental settings such as the data generation and all the LSLDGs 
are the same as in the last experiment. 

Table [T| shows that MT-LSLDG improves the performance especially when 
the dimensionality of data is relatively high and the sample size is small. These 
results indicate that the proposed cross-validation method allows us to choose a 
reasonable 7 value. 


5.3 Benchmark data 


In this section, we demonstrate the usefulness of MT-LSLDG in gradient estima¬ 
tion on various benchmark datasets. 


This experiment uses IDA benchmark repository |Ratsch et ah, 200l]. For 
MT-LSLDG, the hyper-parameters cr, A and 7 are chosen by cross-validation. The 
candidate lists are a £ {0.1, 0.25, 0.5,1, 2.5, 5,10}, A G {10 — 5 ,10 -4 ,..., 10“ 1 } and 
7 £ {0,10~ 5 ,10 -4 ,..., 10 1 ,10 2 , 00 }, respectively. For S-LSLDG and C-LSLDG, 
the candidate lists of a and A are the same as MT-LSLDG. The solution of 


MT-LSLDG is computed by the BCD algorithm described in Section 4.5.2 


The results are presented in Table [2] MT-LSLDG significantly improves the 
performance of either S-LSLDG or C-LSLDG on most of the datasets. 
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Table 1: Averages (and standard errors) of test scores on the artificial data with 
cross-validation over 100 runs. In each row, the best and comparable to the best 
scores in terms of paired t-test with significance level 5% are emphasized in bold 
face. 


Density 

n 

MT-LSLDG 

S-LSLDG 

C-LSLDG 

Single Gaussian 

10 

-2.87 (0.22) 

0.37 (0.31) 

-2.58 (0.07) 


30 

-5.34 (0.038) 

-4.97 (0.08) 

-3.29 (0.03) 


50 

-5.63 (0.02) 

-5.55 (0.02) 

-4.13 (0.04) 

Double Gaussian 

10 

-6.83 (0.14) 

1.01 (0.54) 

-5.02 (0.12) 


30 

-8.45 (0.03) 

-7.63 (0.10) 

-7.84 (0.04) 


50 

-8.67 (0.02) 

-8.29 (0.10) 

-8.48 (0.02) 


Density 

d 

MT-LSLDG 

S-LSLDG 

C-LSLDG 

Single Gaussian 

2 

0.20 (0.19) 

-0.11 (0.15) 

0.09 (0.15) 


10 

-5.34 (0.04) 

-4.97 (0.08) 

-3.29 (0.03) 


20 

-10.77 (0.03) 

-9.98 (0.13) 

-6.39 (0.01) 

Double Gaussian 

2 

0.54 (0.22) 

0.50 (0.27) 

0.19 (0.22) 


10 

-8.45 (0.03) 

-7.63 (0.10) 

-7.84 (0.04) 


20 

-16.9 (0.14) 

-14.90 (0.10) 

-15.26 (0.06) 
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Table 2: Avarages (and standard errors) of the test scores on IDA datasets. In each dataset, the best and comparable to the best 
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6 Application to mode-seeking clustering 


In this section, we apply MT-LSLDG to mode-seeking clustering and experimen¬ 
tally demonstrate its usefulness. 

6.1 Mode-seeking clustering 

A practical application of log-density gradient estimation is mode-seeking clus¬ 
tering Fukunaga and Hostetler 1975, Cheng 1995 Comaniciu and Meer [2002 


Sasaki et ah, 2014 . Mode-seeking clustering methods update each data point 


toward a nearby mode by gradient ascent, and assign the same clustering label 
to the data points which converged to the same mode (Figure [3]). Their notable 
advantage is that we need not specify the number of clusters in advance. Mode¬ 
seeking clustering has been successfully applied to a variety of real world problems 
such as object tracking |Comaniciu et ah 2000 , image segmentation Comaniciu 


and Meer, 2002, Sasaki et ah, 2014 , and line edge detection in images Bandera 


et al. 2006 . 


In mode-seeking, the essential ingredient is the gradient of the data density 
To estimate the gradients, mean shift clustering |Fukunaga and Hostetler, 1975 


Cheng, 1995, Comaniciu and Meer, 2002 , which is one of the most popular mode¬ 


seeking clustering methods, employs the two-step method of first estimating the 
data density by kernel density estimation and then taking its gradient. However, 
as we mentioned earlier, this two-step method does not work well since accurately 
estimating the density does not necessarily lead to an accurate estimate of the 


gradient. 
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Figure 3: Transition of data points during a mode-seeking process. Data samples 
are drawn from a mixture of Gaussians, and the data points sampled from the 
same Gaussian component are specified by the same color (red, green, or blue) 
and marker (plus symbol, circle, or triangle). White squares indicate the points 
to which data points converged. 


22 





In order to overcome this problem, LSLDG clustering [Sasaki et al., 2014 


adopted LSLDG instead of the two-step method. Sasaki et al. 2014 also pro¬ 


vided a practically useful fixed-point algorithm for mode-seeking as in mean shift 


clustering Cheng, 1995 : When the partial derivative of a vector of Gaussian 


kernels xj)j(x) = dj<p(x ) is used as the vector of basis functions, the model 
gj(x) = 0- xf)j(x ) can be transformed as 


b „0) ,y.(j) 

- 2W c k~ x 

k=1 

b 


9j( x ) = 


cr z 


(fi k \x) 


i “ (i) b 


k =1 


1 


cr z 




k= 1 


ELi 

eLi^V^o*) 


x 


O') 


where we assume that ^ Sfc=i (*) is nonzero. Setting gj(x) to zero yields 


CT 2 Z^fc=i r 

a fixed-point update formula as 


e- 




It has been experimentally shown that LSLDG clustering performs significantly 


better than mean-shift clustering Sasaki et al. 2014 . 


Here, we apply MT-LSLDG to LSLDG clustering and investigate if the per¬ 
formance is improved in mode-seeking clustering as well for relatively high¬ 


dimensional data. 


6.2 Experiments 

Next, we conduct numerical experiments for mode-seeking clustering. 
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6.2.1 Experimental setting 


We apply the following four clustering methods to various datasets: 


MT-LSLDGC: LSLDG clustering with MT-LSLDG. 


S-LSLDGC: LSLDG clustering with S-LSLDG Sasaki et al. 2014 


C-LSLDGC: LSLDG clustering with C-LSLDG Sasaki et al. 2014 . 


Mean-shift: mean shift clustering |Comaniciu and Meer, 2002 


For MTL-, S-, and C-LSLDG, all the hyper-parameters are cross-validated as 


described in Section 44, and for mean-shift, log-likelihood cross-validation is used. 
We evaluate the clustering performance by the adjusted Rand index (ARI) (Hu¬ 


bert and Arabie, 1985 . ARI gives one to the perfect clustering assignment and 


zero on average to a random clustering assignment. A larger ARI value means a 


better clustering result. 


6.2.2 Artificial data 

First, we conduct experiments on artificial data. The density of the artificial data 
is a mixture of three d-dimensional Gaussian densities with means (0, 2, 0,..., 0), 
(— 2 , — 2 , 0,..., 0), and ( 2 , — 2 , 0,..., 0), covariance matrices -J=Id , and mixing co¬ 
efficients 0.4,0.3,0.3. The candidate lists of the hyper-parameters are the following: 
a G {10- 1 , IO” 7 / 9 , IO- 5 / 9 ,..., IO 5 / 9 , IO 7 / 9 ,10 1 }, A G {10~ 5 ,10~ 4 ,10~ 3 ,10~ 2 ,10- 1 } 
and, 7 G {0,10“ 5 ,10“ 4 , IO" 3 , IO" 2 ,10' 1 ,10°, 10 1 ,10 2 , 00 }. 

The results are shown in Table [3j We can see that MT-LSLDGC performs well 
especially for the largest dimensionality d — 20 . 
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Tabic 3: Averages (and standard errors) of ARIs on real data. In each row, the 
best and comparable to the best ARI in terms of unpaired t-test with significance 
level 5% is emphasized in bold face. The number of trials is 100. 


d MT-LSLDGC 
2 0.992 (0.035) 

10 0.993 (0.004) 

15 0.983 (0.023) 
20 0.827 (0.190) 


S-LSLDGC 
0.973 (0.125) 
0.994 (0.003) 
0.982 (0.054) 
0.586 (0.208) 


C-LSLDGC 
0.992 (0.036) 
0.994 (0.004) 
0.877 (0.217) 
0.716 (0.352) 


Mean-shift 
0.984 (0.044) 
0.042 (0.022) 
0.000 ( 0 . 000 ) 
0.036 (0.037) 


6.2.3 Real data 


Next, we perform clustering on real data. The following three datasets are used: 


Accclerometry data: 5-dimensional data used in Hachiya et ah, 2012 for hu¬ 


man activity recognition extracted from mobile sensing data available from 
http://alkan.rans.kyutech.ac.jp/web/data. The number of classes is 3. 
In each run of experiment, we use randomly chosen 100 samples from each 
class. The total number of samples is 300. 


• Vowel data: 10-dimensional data of recorded British English vowel 
sounds available from https://archive.ics.uci.edu/ml/datasets/ 
Connectionist+Bench+(Vowel+Recognition+-+Deterding+Data), The 
number of classes is 11 In each run of experiment, we use randomly chosen 
500. 


• Sat-image data: 36-dimensional multi-spectral satellite image available 
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from https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+ 
Satellite). The number of classes is 6. In each run of experiment, we 
use randomly chosen 2000 samples. 


Speech data: 50-dimensional voice data by two French speakers Sugiyama 


et ah, 2014 . The number of classes is 2. In each run of experiment, we use 


randomly chosen 200 samples from each class. The total number of samples 


is 400. 


For MT-LSLDG, the hyper-parameters are cross-validated us¬ 
ing the candidates, a G {1CT 1 ,1CT 6 / 9 ,10 3//9 ,..., 10 12 / 9 ,10 15 / 9 ,10 2 }, 
A G {10~ 5 ,10~ 4 ,10“ 3 ,10~ 2 , lO” 1 ,10°} and 7 G {10“ 5 ,10" 4 ,..., 10 1 ,10 2 }, 
except that we use relatively small candidate lists a G {0.5,1,2.5,5,10}, 
A G {0.003,0.01,0.1,1} and 7 G {0.1,1,10} for the speech data since it has 
large dimensionality and optimization is computationally expensive. For S- and 
C-LSLDG, we used the same candidates of MT-LSLDG for a and A. For mean 
shift clustering, the Gaussian kernel is employed in KDE, and the bandwidth 
parameter in the kernel is selected by 5-fold cross-validation with respect to the 
log-likelihood of the density estimate from the same candidates of MT-LSLDG 
for a. 

The results are shown in Table [4j For the accelerometry data whose dimen¬ 
sionality is only five, S-LSLDGC gives the best performance and MT-LSLDGC 
does not improve the performance, although MT-LSLDGC performs better than 
C-LSLDGC. 
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Table 4: Averages (and standard errors) of ARIs on real data. In each row, the best 
and comparable to the best ARI in terms of paired t-test with significance level 
5% is emphasized in bold face. The number of trials is 100 for the accelerometry 
data and the sat-image data, and is 20 for the speech data. 


dataset (d, n) 

MT-LSLDGC 

S-LSLDGC 

C-LSLDGC 

Mean-shift 

accelerometry (5, 300) 

0.40 (0.01) 

0.53 (0.02) 

0.24 (0.01) 

0.26 (0.04) 

vowel (10, 500) 

0.15 (0.00) 

0.15 (0.00) 

0.15 (0.00) 

0.04 (0.00) 

sat-image (36, 2000) 

0.48 (0.00) 

0.43 (0.01) 

0.35 (0.00) 

0.00 (0.00) 

speech (50, 400) 

0.17 (0.02) 

0.00 (0.00) 

0.15 (0.01) 

0.00 (0.00) 


On the other hand, for the higher-dimensional dataset, the vowel data, the 
sat-image data, and the speech data, the performance of MT-LSLDGC is the best 
or comparable to the best. These results indicate that MT-LSLDG is a promising 
method in mode-seeking clustering especially when the dimensionality of data is 
relatively large. 

7 Conclusion 

We proposed a multi-task log-density gradient estimator in order to improve the 
existing estimator in higher-dimensional cases. Our fundamental idea is to exploit 
the relatedness inhering in the partial derivatives through regularized multi-task 
learning. As a result, we experimentally confirmed that our method significantly 
improves the accuracy of log-density gradient estimation. Finally, we demon¬ 
strated its usefulness of the proposed log-density gradient estimator in mode- 
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seeking clustering. 


Log-density gradient estimation would be useful in a measure for non- 
Gaussianity ffuber, 1985 and other further fundamental statistical topics |Singh, 


1977 . In the future work, we will investigate the performance of our proposed 


method in these topics. 
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